44
Algorithms for Binary Neural Networks
where θ and λ are hyper parameters, ⃗M = {M 1, ..., M N} are M-Filters, and ˆC is the
binarized filter set across all layers. Operation ◦defined in Eq. 3.12 is used to approximate
unbinarized filters based on binarized filters and M-Filters, leading to filter loss as the first
term on the right of Eq. 3.18. The second term on the right is similar to the center loss
used to evaluate intraclass compactness, which deals with the feature variation caused by
the binarization process. fm( ˆC, ⃗M) denotes the feature map of the last convolutional layer
for the mth sample, and f( ˆC, ⃗M) denotes the class-specific mean feature map of previous
samples. We note that the center loss is successfully deployed to handle feature variations.
We only keep the binarized filters and the shared M-Filters (quite small) to reduce the
storage space to calculate the feature maps after training. We consider the conventional
loss and then define a new loss function LS,M = LS + LM, where LS is the conventional
loss function, e.g., softmax loss.
Again, we consider the quantization process in our loss LS,M, and obtain the final
minimization objective as:
L(C, ˆC, M) = LS,M + θ
2∥C[k] −C −ηδ[k]
C ∥2,
(3.19)
where θ is shared with Eq. 3.18 to reduce the number of parameters. δ[k]
C
is the gradient
of LS,M with respect to C[k]. Unlike conventional methods (such as XNOR), where only
the filter reconstruction is considered in the weight calculation, our discrete optimization
method provides a comprehensive way to calculate binarized CNNs by considering filter
loss, softmax loss, and feature compactness in a unified framework.
3.4.3
Back-Propagation Updating
In MCNs, unbinarized filters Ci and M-Filters M must be learned and updated. These two
types of filters are jointly learned. In each convolutional layer, MCNs sequentially update
unbinarized filters and M-Filters.
Updating unbinarized filters: The gradient δ ˆ
C corresponding to Ci is defined as
δ ˆ
C = ∂L
∂ˆCi
= ∂LS
∂ˆCi
+ ∂LM
∂ˆCi
+ θ( C[k] −C[k] −η1δ[k]
C ),
(3.20)
Ci ←Ci −η1δ ˆ
C,
(3.21)
where L, LS, and LM are loss functions, and η1 is the learning rate. Furthermore, we have
the following.
∂LS
∂ˆCi
= ∂LS
∂Q · ∂Q
∂ˆCi
=
j
∂LS
∂Qij
· M
′
j,
(3.22)
∂LM
∂ˆCi
= θ
j
(Ci −ˆCi ◦Mj) ◦Mj,
(3.23)
Updating M-Filters: We further update the M-Filter M with C fixed. δM is defined as
the gradient of M, and we have:
δM = ∂L
∂M = ∂LS
∂M + ∂LM
∂M ,
(3.24)
M ←|M −η2δM|,
(3.25)